18 research outputs found

    Towards Accurate Multi-person Pose Estimation in the Wild

    Full text link
    We propose a method for multi-person detection and 2-D pose estimation that achieves state-of-art results on the challenging COCO keypoints task. It is a simple, yet powerful, top-down approach consisting of two stages. In the first stage, we predict the location and scale of boxes which are likely to contain people; for this we use the Faster RCNN detector. In the second stage, we estimate the keypoints of the person potentially contained in each proposed bounding box. For each keypoint type we predict dense heatmaps and offsets using a fully convolutional ResNet. To combine these outputs we introduce a novel aggregation procedure to obtain highly localized keypoint predictions. We also use a novel form of keypoint-based Non-Maximum-Suppression (NMS), instead of the cruder box-level NMS, and a novel form of keypoint-based confidence score estimation, instead of box-level scoring. Trained on COCO data alone, our final system achieves average precision of 0.649 on the COCO test-dev set and the 0.643 test-standard sets, outperforming the winner of the 2016 COCO keypoints challenge and other recent state-of-art. Further, by using additional in-house labeled data we obtain an even higher average precision of 0.685 on the test-dev set and 0.673 on the test-standard set, more than 5% absolute improvement compared to the previous best performing method on the same dataset.Comment: Paper describing an improved version of the G-RMI entry to the 2016 COCO keypoints challenge (http://image-net.org/challenges/ilsvrc+coco2016). Camera ready version to appear in the Proceedings of CVPR 201

    Spatial Motion Doodles: Sketching Animation in VR Using Hand Gestures and Laban Motion Analysis

    Get PDF
    International audienceWe present a method for easily drafting expressive character animation by playing with instrumented rigid objects. We parse the input 6D trajectories (position and orientation over time)-called spatial motion doodles-into sequences of actions and convert them into detailed character animations using a dataset of parameterized motion clips which are automatically fitted to the doodles in terms of global trajectory and timing. Moreover, we capture the expres-siveness of user-manipulation by analyzing Laban effort qualities in the input spatial motion doodles and transferring them to the synthetic motions we generate. We validate the ease of use of our system and the expressiveness of the resulting animations through a series of user studies, showing the interest of our approach for interactive digital storytelling applications dedicated to children and non-expert users, as well as for providing fast drafting tools for animators

    Recognizing multimodal entailment

    Get PDF
    How information is created, shared and consumed has changed rapidly in recent decades, in part thanks to new social platforms and technologies on the web. With ever-larger amounts of unstructured and limited labels, organizing and reconciling information from different sources and modalities is a central challenge in machine learning. This cutting-edge tutorial aims to introduce the multimodal entailment task, which can be useful for detecting semantic alignments when a single modality alone does not suffice for a whole content understanding. Starting with a brief overview of natural language processing, computer vision, structured data and neural graph learning, we lay the foundations for the multimodal sections to follow. We then discuss recent multimodal learning literature covering visual, audio and language streams, and explore case studies focusing on tasks which require fine-grained understanding of visual and linguistic semantics question answering, veracity and hatred classification. Finally, we introduce a new dataset for recognizing multimodal entailment, exploring it in a hands-on collaborative section. Overall, this tutorial gives an overview of multimodal learning, introduces a multimodal entailment dataset, and encourages future research in the topic

    Paul Debevec's SIGGRAPH99 course no 39 on Image-Based Modeling and Rendering Video Based Animation Techniques for Human Motion

    No full text
    scenes, or architectural scenes. Explicit geometric structures are combined with image data. Texture mapping and view morphing are simple examples. We can generate new images from a collection of recorded images. Simple geometry dictates coarse transformations of fine grained image texture. New views of a scene can be generated in blending between the transformed example textures. This is a trade-off between explicit structure (collection of views and geometric model) and implicit example data (the image texture). Such trade-offs are applied to other domains as well. The most successful speech production systems (text-to-speech, concatenative speech) follow a similar philosophy. A collection of annotated example sounds are used to create new sounds. A sentence is build from phonemes (explicit structure). To blend the phonemes together, the sound examples are pitch and time warped (implicit data). We will show how this extends to video data and human motion animation. Structure vs Data for Animation: So far most graphical animation techniques do not exploit such trade-offs between explicit structure and implicit data. Many facial and body animations are generated by 3D volumetric models and physical simulations. Some facial animation systems texture map images onto the geometric model, or morp

    Performance driven facial animation using blendshape interpolation

    No full text
    This paper describes a method of creating facial animation using a combination of motion capture data and blendshape interpolation. An animator can design a character as usual, but use motion capture data to drive facial animation, rather than animate by hand. The method is effective even when the motion capture actor and the target model have quite different shapes. The process consists of several stages. First, computer vision techniques are used to track the facial features of a talking actress in a video recording. Given the tracking data, our system automatically discovers a compact set of key-shapes that model the characteristic motion variations. Next, the facial tracking data is decomposed into a weighted combination of the key shape set. Finally, the user creates corresponding target key shapes for an animated face model. A new facial animation is produced by using the same weights as recovered during facial decomposition, and interpolated with the new key shapes created by the user. The resulting facial animation resembles the facial motion in the video recording, while the user has complete control over the appearance of the new face. 1

    Finding Pictures of Objects in Large Collections of Images

    Get PDF
    "Retrieving images from very large collections using image content as a key is becoming an important problem. Users prefer to ask for pictures using notions of content that are strongly oriented to the presence of objects, which are quite abstractly defined. Computer programs that implement these queries automatically are desirable but are hard to build be-cause conventional object recognition techniques from computer vision cannot recognize very general objects in very general contexts. This paper describes an approach to object recognition structured around a sequence of increasingly specialized grouping activities that assemble coherent regions of image that can be shown to satisfy increasingly stringent constraints. The constraints that are satisfied provide a form of object classification in quite general contexts. This view of recognition is distinguished by far richer involvement of early visual primitives, including color and texture; the ability to deal with rather general objects in uncontrolled configurations and contexts; and a satisfactory notion of classification. These properties are illustrated with three case studies: one demonstrates the use of descriptions that fuse color and spatial properties; one shows how trees can be de-scribed by fusing texture and geometric properties; and one shows how this view of recognition yields a program that can tell, quite accurately, whether a picture contains naked people or not."published or submitted for publicatio

    COSMOS: Catching Out-of-Context Image Misuse Using Self-Supervised Learning

    No full text
    Despite the recent attention to DeepFakes, one of the most prevalent ways to mislead audiences on social media is the use of unaltered images in a new but false context. We propose a new method that automatically highlights out-of-context image and text pairs, for assisting fact-checkers. Our key insight is to leverage the grounding of images with text to distinguish out-of-context scenarios that cannot be disambiguated with language alone. We propose a self-supervised training strategy where we only need a set of captioned images. At train time, our method learns to selectively align individual objects in an image with textual claims, without explicit supervision. At test time, we check if both captions correspond to the same object(s) in the image but are semantically different, which allows us to make fairly accurate out-of-context predictions. Our method achieves 85% out-of-context detection accuracy. To facilitate benchmarking of this task, we create a large-scale dataset of 200K images with 450K textual captions from a variety of news websites, blogs, and social media post
    corecore